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We present a new multigrid method called neural multigrid which is based on joining 
multigrid ideas with concepts from neural nets. The main idea is to use the Greenbaum 
criterion as a cost functional for the neural net. The algorithm is able to learn efficient 
interpolation operators in the case of the ordered Laplace equation with only a very 
small critical slowing down and with a surprisingly small amount of work comparable to 
that of a Conjugate Gradient solver. 

In the case of the two-dimensional Laplace equation with SU(2) gauge fields at 
/3 = the learning exhibits critical slowing down with an exponent of about z 0.4. 
The algorithm is able to find quite good interpolation operators in this case as well. 
Thereby it is proven that a practical true multigrid algorithm exists even for a gauge 
theory. An improved algorithm using dynamical blocks tthat will hopefully overcome 
the critical slowing down completely is sketched. 

Keywords: Discretized differential equations, multigrid, neural nets, disordered systems, 
lattice gauge theory 

1. Introduction 

Multigrid methods are among the most successful strategies for solving discretized 
differential equations. In the presence of disorder, which is the case of most interest 
to contemporary physics, the development of methods that are able to deal with 
the lost translational invariance of the system has proven extremely difficult. As an 
example we may look at the efforts to find a multigrid solver for the Dirac equation 



has been found that is able to deal with this problem as efficiently as we would like, 
although there is a unigrid method called ISU that has been proven to work in two 
dimensionsB. 

The development of the ISU algorithm has been triggered by an attempt to con- 
nect multiscale ideas with the method of neural nets a. A more direct combination 
of multigrid and neural net methods was based on the idea to learn the shapes of 
slow- converging modes by a standard back-propagation scheme, using the known 
errors of test problems as targets. This, however, was shown to be not feasible S 
The main reason was that the cost functional possessed only a very narrow mini- 




Up to now, no true multigrid algorithm 
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mum, which was difficult to find by the neural net whenever the criticality of the 
problem operator (i.e. its condition number) was large. 

In this work we retain the idea of using test problems to do the learning, but we 
do not try to learn them as in a pattern recognition neural network. Instead of this, 
we borrow ideas from the ISU algorithm!!! and the principle of indirect elimination!! 

2. The problem 

Consider a linear operator D which may arise from a discretized differential equation 
defined on a cubic lattice A . "Here and in the following we assume D to be positive 
definite, if it were not, we could use the operator D*D instead. The general form 
of the equation to be solved is then 

D£ = f . (1) 

It is well known that standard solvers like Conjugate Gradient or Overrelaxation 
show the phenomenon of critical slowing down: The number of iterations needed to 
solve the equation with a given precision scales with some power of the condition 
number (quotient of the largest and smallest eigenvalue) . 

At each time-step, any iterative method will yield an approximate solution £. 
We introduce two important quantities: the error e = £ — £ which is the difference 
between the true and the actual solution and is of course not known, and the 
residual r = f D£, the difference between the true and the actual righthandside. 
With these definitions we can recast the fundamental equation (Q) as 

De = r , (2) 

called the error equation. 

For a linear method, we can also introduce the iteration matrix S which tells us 
what the new error after the next iteration step will be, given the old one: 

e" = S e old . (3) 

3. The Neural Multigrid 

Although our method is motivated by ideas borrowed from neural network meth- 
ods, a thorough understanding of these methods is_nnt necessary. A good introduc- 
tory text is For introductions to multigrid see ErtrEl. 

The basic setup of a multigrid algorithm uses auxiliary lattices, also called block 
lattices or coarse grids, A 1 , A 2 , ... , A N with lattice spacings aj — L J b ao, where Lb is 
the blocking factor and is usually chosen to be 2. The last lattice A N should contain 
few enough points that a direct solution of any equation living on this lattice is easy. 

a It is not necessary to restrict our attention to the case of a cubic lattice, but it eases the imple- 
mentation of the method. 
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Let TiP be the space of functions on lattice A J . Then we introduce grid transfer 
operators: 

the interpolation operators : A k : TL k+1 \— * 7i k and (4) 
the restriction operators : C k : H k h-> H k+1 , (5) 

with k < N. So operators A k interpolate from a coarser to a finer grid, whereas 
the restriction operators do the reverse. For reasons of efficiency these operators 
are not allowed to interpolate from one block grid point to all of the fine grid: if 
we identify a block grid point x with its corresponding fine grid point z we must 
require A k (z, x) = unless x lies near z. We will always choose the operators to 
be adjoints of each other: C k — A k * . We also recursively define effective operators 
D fc = G l ~D k ~ x A x , of course setting D° = D to stop the recursion. 

Multigrid methods are based on the observation that standard relaxation al- 
gorithmalaO usually smoothen the error, or in more general terms, project onto 
the lower half of the eigenspectrum of the problem operator D°. If we can find 
interpolation operators A such that the smooth error lies within their range, we 
can write e° w A°e 1 . Inserting this into the error equation (||) we find 

r° (6) 
C°r° (7) 
r 1 , (8) 

involving only quantities on the block lattice. Solving this equation and interpolat- 
ing back yields a good estimator for the error e°. The block-equation (§) can be 
solved recursively by going to a still coarser grid, until we reach the coarsest layer 
A N where the equation can be solved directly. 

After this review of multigrid methods let us now ^et up the neural multigrid. 
The basic fact we are using is the Greenbaum criterionH, which up to now has been 
considered to have no practical use at all. It states that optimal convergence of the 
multigrid is achieved when the interpolation operators are able to represent all the 
modes of the system that are slowly converging under the used relaxation process. 
In other words, all the highest eigenmodes of the iteration matrix of the relaxation 
process have to lie within the range of the interpolation operators. Stated like this 
it is clear why the practical value of this principle is small: How should we know 
all of the bad-converging modes of the system? 

To learn these modes we are going to use test problems: A test problem is 
a problem with a known solution, chosen such that the initial error is a random 
function on the lattice. This can be done by drawing the exact solution £ from 
a random distribution, calculating the test problem's righthandside f = D£ and 
starting with the initial guess £ = 0. 

A first idea to exploit the criterion uses the fact that indeed we have a good pro- 
jector onto the bad-converging modes, namely the relaxation process itself. There- 
fore it would be possible to start with a test problem, do some relaxations and 



D A e = 
C°T>°A°e 1 = 
D^ 1 = 
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thereby project the initial error onto the space of the slow-converging modes. This 
error could then be learned (in some way) by the neural multigrid. 

Of course this method is doomed to fail: To exhaust the complete space of the 
slow modes would take a very long time because the projection on the worst modes 
will only be efficient after many relaxation sweeps. This means that the projection 
method itself suffers from critical slowing down, and therefore our neural multigrid 
will as well. 

Nevertheless, our new method is based on this idea, but with another ingredient, 
which is similar to the spirit of ISU: After a few relaxation steps, the remaining 
error is something that should definitely be learned. So we adapt our interpolation 
operators to this error (the details will be explained later on), and then use the 
newly learned interpolation operators to further reduce the error. To do so, we 
can make a coarse-grid correction step. (For the time being, think of the method 
as a twogrid- method, i.e. N = 1.) This then efficiently removes all those error 
components already learned. We then relax again and start the learning process 
anew (of course using everything we have already learned) . 

It is easy to see why this avoids the pitfall described above: After the first 
relaxation step the error will contain contributions from all slow modes. Some of 
these will be learned by the multigrid and are therefore removed. In this way we 
successively project out everything that is not yet learned and then learn it. After 
each learning step, we measure the convergence rate of the algorithm as it stands 
now by solving another test problem. If this is sufficient (for instance, if the error is 
reduced by a factor of 2 during each multigrid cycle), we stop the learning procedure, 
otherwise we continue. 

It may happen that one coarse-grid correction step is not enough to project the 
error onto the nullspace of the interpolation operators. In this case we may use 
the error e« of the measurement iteration for the next learning step by setting 
f = DeM. This was the method actually used to obtain the results shown below. 

4. The learning process 

Let us now look at the method in greater detail: How is the learning actually 
done? 

The first thing in setting up a neural network is to decide on a cost functional 
that decides how good a certain network configuration is. As we want to have the 
error after relaxation within the range of the interpolation operators, we choose as 
functional 

|| e ||2 . W 

where A is the interpolation operator, e is the actual error and <5e is chosen such as 
to minimize E with the given interpolation operators, at least approximately. Se is 
the coarse-grid correction we would get using these interpolation operators, so it is 
a function living on the coarse grid. With x being a coarse grid point and z a fine 
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grid point, we can rewrite E using indices as 

There are two possibilities to use this cost functional: The standard way in neural 
networks would be to calculate the derivatives dE / dA(z,x) and do an adjustment 
step in direction of this gradient, perhaps also including some momentum termliJ. 

However, we can do better than this: Remembering that our interpolation op- 
erators are localized objects, we see that at each point the derivative only depends 
on very few of the interpolation operators. We can therefore minimize E exactly at 
each point so that the actual error lies within the range of the interpolation opera- 
tors. To do so, we require dE/dA — at each point. This is possible only because 
the space of interpolation operators is larger than the space of fine grid functions, 
as most of the fine grid points are covered by more than one interpolation operator. 
In fact, the system of equations dE/dA = is under- determined because of this. 
This is an advantage as we can adjust the interpolation operators and still retain 
some memory of their old values. 

To describe the rules of the learning, let us define the difference vector d(z) = 
^2 X A{z, x) 5e(x) — e(z), which is nothing but the error after the correction step 
using Se. By SA{z, x) we denote the change in A at the specified point. The 
condition to be fulfilled is therefore that d(z) would be zero after the change of A: 

^2(A(z,x) +5A(z,x))5e(x) -e(z) = . (11) 

X 

As this system of equations is under-determined, we may add another require- 
ment. In order to keep some memory of already learned things, we require that 
J2 X z &A(z, x ) 2 = mm 7 so we f° r tn at choice of A that is closest to the old A. 

To solve these equations we can use the method of Lagrangian multipliers (min- 
imizing the change in A and using eq. ( |ll|) as a constraint) and we find 

SMz x) - d(z) Se ( x) (12) 

Here the condition means that we have to use all those points x' from which 

the interpolation operator can reach the point z. 

Finally, let us remark on the choice of Se: The obvious method to determine 
it would be to use a full coarse-grid-correction step. A somewhat simpler (and 
cheaper) choice is to choose Se such that the error is put to zero at those fine grid 
points only reached by one interpolation operator, i.e. in the center of each block. 
This should result in a good approximation of the optimal choice of Se and is much 
cheaper. Up to now, this method has been used. 
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5. The true multigrid 

For a true multigrid, all we have to do is to use the algorithm above recursively. 
Whenever a coarse-grid-correction is required, we try to solve the coarse grid equa- 
tion, monitoring the convergence. If the convergence is not good enough, we set up 
a test problem on the coarse layer, learn good interpolation operators on this layer 
and then solve the coarse grid equation again. 

This looks like a very expensive method, for the average number of learning 
cycles required on a certain layer will quickly blow up the coarser the layer is. If 
we need only two learning steps in each learning process the method will proceed 
through the multigrid in a W-cycle fashion — more learning steps are analogous to 
higher cycles. In two dimensions, the maximal cycle index allowed is four, so we 
will quickly reach a limiting point where learning becomes extremely slow and has 
some kind of critical slowing down. 

Two facts oppose this tendency for the process to get slow: First of all, as long 
as the interpolation operators are not yet very good, the coarse grid equation will 
not be as critical as the fine grid equation, so solving it is easier. Secondly, as 
the changes on each layer are gradual, not every change on a fine layer will need 
adaptations on coarser layers. 

Of course, a definite answer can only be found by putting the method to the 
test. 

6. Possible improvements 

6.1. Adding an indirect elimination 

A very simple improvement that can be added to the algorithm is to introduce an 
update based on the principle of indirect elimination: If the convergence rate is 
not satisfactory, we store the error after a measurement and do a line search in the 
direction of this error vector before each multigrid cycle. As the error after the 
measurement corresponds (at least approximately) to the worst-converging mode 
of the algorithm, the line search will take care of this mode, so that it will not 
contribute to the error in the following multigrid stecs. A thorough explanation of 
the principle of indirect elimination can be found in □. 

6.2. Adding a memory 

Another possibility is to store some of the errors learned and to relearn them in 
later learning cycles. As we exactly minimize the cost function in each step, we 
may forget some of the error shapes learned earlier. If we store these errors, we 
can use them again. In this way we attempt to find interpolation operators that 
are able to deal with all the learned error shapes. It is well-known from the neural 
net context that a shape once learned might be forgotten later on if it is not shown 
again to the network. To be more precise, what we add to the algorithm are the 
following steps: 
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Instead of the simple adaptation step we have 

do nrOfMemorySteps times: 

Adapt to the actual error 

Adapt to all memorized errors 
Adapt to the actual error again 

In practice, we usually choose to do three memory steps. We store only those errors 
for later relearning whose learning has reduced the convergence time appreciably. 

7. Results for the ordered case 

As a first test we investigated the neural multigrid for the standard Laplace equation 
in two dimensions on a square lattice. We use periodic boundary conditions and 
add a small mass term: 

(A + m 2 )£ = f . (13) 

It is well known that in this case a standard multigrid method will exhibit excel- 
lent convergence properties, reducing the error at least by a factor of ten in each 
multigrid cyclctm. 

We studied the behaviour of our neural multigrid on lattices of size 16 2 up to 
64 2 . The coarsest layer was always chosen to have a size of 2 2 , so that the scaling 
behaviour of the algorithm could be studied. Remember that this is one of the 
key questions for this method: How much do the additional learning steps on the 
coarser layers affect the overall work done by the algorithm? 

For a simple test case in one dimension it was found that the situation is not 
favorable: The larger the lattice became, the greater was the amount of work needed 
to learn good interpolation operators. We estimated a critical exponent of 0.5. 
However, the larger the dimension, the smaller is the relative size of the block 
lattices compared to the finest lattice, so in higher dimensions the situation might 
get better. 

Figure [j] shows the work needed to learn interpolation operators that achieve the 
textbook efficiency of a reduction factor of at least 10 in each multigrid cycle for the 
two-dimensional case. The work was measured in elementary vector operations, i.e. 
each vector addition, multiplication on the finest layer etc. counts as one work unit, 
a vector operation on the first block layer as 1/4 work units and so on. For each set 
of parameters ten runs have been made. (Note that although the problem operator 
is always the same, the righthandside and the randomly chosen start vectors differ 
on each run.) 

We can see two things: The number of work units needed does not grow with 
the lattice size. This is an encouraging result because it means that the recursive 
structure of the method does not lead to an impractical algorithm. 

On the other hand we can see that the needed work does increase with the 
criticality of the problem. For mass values smaller than 10 -4 the work grows ap- 
preciably when we lower the mass further. We see that the work needed at masses 
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Figure 1: Number of Work units (elementary vector operations; see text) needed 
by the neural multigrid to learn interpolation operators with a reduction rate of 
smaller than 0.1 and to solve the simple Laplace equation. For each data point, ten 
runs were made. The work does not depend on the lattice size, but for very small 
eigenvalues the needed work increases. (At 10~ 6 one of the runs on the 16 2 -latticc 
did not converge, this was not taken into account here, see text.) Note the offset of 
the y-axis. 
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Figure 2: Reduction factor p achieved by the neural multigrid after one learning 
step on the hnest lattice has been completed. Again there is no dependence on the 
lattice size, but for very small mass values of the Laplace equation the reduction 
factor is gets large. 

10~ 5 is approximately twice that we need at mass 10 -2 ; out of this we would es- 
timate a critical exponent of 0.2. (The critical exponent z for fixed lattice size is 
given by (Work needed) cx k z / 2 , where k is the condition number.) This is a fairly 
small value (e.g. when compared to the value for Conjugate Gradient algorithms 
with 2 = 1). Of course, with the present data we cannot exclude that the work 
needed will grow faster when we decrease the mass even further. 

At a mass value of 10 -6 on a 16 2 -lattice, the neural multigrid did not reach the 
required convergence rate within 20 learning cycles in one of the ten runs. However, 
it did achieve a reduction factor of 7 within a few iterations. This run has not been 
taken into account in the picture, but it shows again that the algorithm has greater 
difficulties at very small eigenvalues. 

We can also look at the reduction rate p that is achieved after the first learning 
step on the finest layer is finished, see figure |[ Again we see that for masses of 10 -4 
and larger the results are very good and textbook efficiency is usually achieved; for 
smaller mass values the convergence deteriorates. 

Although the results are not too bad, we can overcome this deterioration by 
adding an improvement as explained in section 6.1. The application of the principle 
of indirect elimination will eliminate the contribution from the lowest eigenmode of 
the problem operator, which is the one causing the problems here. After adding such 
an updating step the convergence stays constant regardless of the mass value. Note, 
however, that this is only possible because the number of low-lying eigenmodes is 
small in the case of the Laplace equation. 
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Wc believe that the results shown in this section are quite encouraging: There is 
no dependence of the work needed on the lattice size. Furthermore, except for very 
high criticality the absolute number of work units is not too high. For comparison, 
a Conjugate Gradient algorithm needs about 3700 work units to solve the equation 
with the desired accuracy on the 64 2 -lattice at a mass value of 10~ 4 . (Note that 
the work for the final solution of the equation is included in figure g.) This is about 
a factor of two smaller that the work needed by the neural multigrid; however we 
have not tuned the parameters to minimize the work; for instance the number of 
multigrid sweeps done in each measurement has been chosen to be 20, which is quite 
large. In addition, during the program runs several test measurements of errors, 
residuals etc. are done that are not strictly necessary. Therefore we believe that 
the neural multigrid still has a great potential for improvements. The real test is 
of course its behaviour in the case of disordered problems. 

8. Lattice Gauge Theory 
8.1. The Problem 



As an example for a disordered system we consider the propagator equation for 
a bosonic particle in an SU(2)-gauge field background!! The covariant Laplace- 
operator in stencil-notation is 



A(z) 



U 



U z ,2 

4 



U 



2,-2 





u z>x 





(14) 



with U z ,n € SU(2). The second index denotes the direction of the coupling to the 



neighbour. The link matrices 
according to the Wilson action 

Sw - 



fj. fulfill U,,-^ 



U* 



^Retr (1 - Up) 



They are distributed 



(15) 



Here (3 = A/g 2 is the inverse coupling and the sum is over all Plaquettes in the 
lattice. Up denotes the parallel transport around a Plaquette. This distribution 
leads to a correlation between the gauge field matrices with finite correlation length 
X for finite (3. The case (3 = corresponds to a completely random choice of the 
matrices (x = 0), for (3 = oo all matrices are 1 (x = oc). In this sense, (3 is a 
disorder parameter, the smaller (3 the shorter the correlation length and the larger 
the disorder. 

Here we are only interested in the behaviour of our algorithm for a disordered 
system and so we will choose in the following (3 = 0, which gives the greatest 
disorder possible. 

As it stands, however, the equation is not critical: The lowest eigenvalue will 
be quite large (of the order of 0.5). To get a disordered critical problem we first 
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Figure 3: Number of Work units needed by the neural multigrid to learn interpo- 
lation operators with a reduction rate of smaller than 0.6 and to solve the SU(2) 
Laplace equation. For each data point, ten runs were made, except for the 32 2 and 
64 2 lattices at 8m 2 — 10 -4 , where 20 and 15 runs were done. 

calculate the lowest eigenvalue and then subtract it from the diagonal part of the 
operator. This allows us to tune criticality and thereby to measure the conver- 
gence behaviour accurately. Note that this destroys the diagonal dominance of the 
operator and makes the problem quite difficult for a multigrid method. 

8.2. Results 

Trying the neural multigrid on the described problem, we found quickly that it was 
not possible to achieve the desired textbook efficiency of p — O.f . A more realistic 
goal seemed to require the neural multigrid to reach a p < 0.6, which means that 
for the same error reduction we need about four times the work. But if this could 
be reached regardless of criticality and lattice size, the algorithm would still be very 
efficient. 

To improve the convergence, we introduced the improvements described in sec- 
tion 6. However, as a look at fig. || shows, the work needed to learn good interpo- 
lation operators that achieve the desired reduction rate grows with the criticality 
and at first also with the lattice size. The growth with the lattice size stops as 
soon as the 32 2 -lattice is reached, so for large enough lattices there seems to be no 
critical slowing down here. For the growth with the criticality the exponent can be 
estimated roughly to be about z w 0.4. Although this is not too bad, the absolute 
number of work units is quite large; about f 5 times larger than that needed by a 
Conjugate Gradient algorithm. 
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These results are somewhat disappointing. As it stands, the algorithm is not as 
efficient as we hoped for. Nevertheless, we have achieved an important result: The 
algorithm was able to find interpolation operators that allow for multigrid cycles 
with p < 0.6 in almost all cases. This means that we have shown numerically that 
there exists a practical true multigrid algorithm for a two-dimensional bosonic gauge 
theory without critical slowing down. Up to now it was only known that idealized 
multigrid algorithms (in four dimensions) were able to eliminate critical slowing 
down, but these used non-local interpolation operatorstil. The ISU algorithmic also 
eliminates critical slowing down for our test problem, but being a unigrid the work 
it needs grows as In 2 (Volume). By investigating the interpolation operators found 
by the neural multigrid more closely we might find valuable informations about how 
good interpolation operators should look like. The same can also be tried for the 
case of greater interest, namely the four-dimensional Dirac equation. 

8.3. Overcoming the difficulties 

Is it possible to overcome the described difficulties? To answer this, we have to 
investigate the reason for the problems with the gauge theory. A first hint is that 
most of the work was done on the coarser layers. This means that as a twogrid, 
our algorithm would have (nearly) no critical slowing down. Are the coarser layers 
more problematic than the finest one? 

Indeed they are. As the interpolation operators are not simple linear functions, 
the effective operator D fe has strongly fluctuating couplings and a fluctuating diago- 
nal term as well. A simple look at the operators shows that after two blocking steps 
the coupling strengths may vary by a factor of ten or more. It is well known that 
for couplings with such strong fluctuations a simple blocking scheme with square 
blocks will not be efficient. 

In order to overcome this difficulty, an algorithm to determine good shapes 
for the supports of the interpolation operators is needed. Such an algorithm was 
developed as a part of the Algebraic MultigricO and could be used as an ingredient 
to our neural multigrid. This algorithm chooses those points as block-centers that 
have many strong connections to other points of the lattice and is quite efficient. 

A preliminary study to confirm this picture was also done: We used the scalar 
Laplace equation with a site-dependent mass term. At low criticality, the neural 
multigrid again exhibited critical slowing down when the mass term was strongly 
fluctuating. A study of the errors showed that indeed these are the points where 
the error is not properly reduced. This problem was partly alleviated by shifting 
the lattice such that the points with the weakest connections were not chosen as 
block-centers, however some critical slowing down still remained. 

So it is probable that using dynamically chosen blocks the algorithm would 
perform much better, perhaps even without critical slowing down. 

9. Conclusions 



Another Look at Neural Multigrid 13 



We have presented a new multigrid method called neural multigrid which is based 
on joining multigrid ideas with concepts from neural nets. The algorithm is able to 
learn efficient interpolation operators in the case of the ordered Laplace equation 
nearly without critical slowing down and with a surprisingly small amount of work 
comparable to that of a Conjugate Gradient solver. 

In the case of a disordered system (the Laplace equation with SU(2) gauge fields) 
the learning exhibited critical slowing down with an exponent of about z « 0.4 and 
the algorithm was able to find good interpolation operators in this case as well. 
Finally it was shown that the remaining critical slowing down of the algorithm might 
be overcome by choosing the supports of the interpolation operators dynamically. 
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